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WHAT IS CLAIMED IS: 

1 • A method of categorizing an initial collection of documents, ^^ch 
Jocument being represented by a string of characters, the method corapnsing 
the steps of: 

identifying predefined characters in the string of charactep^from the 
documents in the initial collection of documents to form identified characters; 

changing the identified characters in the documents in the initial collection 
of documents to form a preprocessed collection of doctiments; 

constructing a number of categories from thp^reprocessed collection of 
documents; and 

assigning each document in the preprodessed collection of documents to 
a category to form a hierarchy of categories of documents. 



2. The method of claim 1 ^herein the step of constructing a number of 
categories includes the steps of: 

clearing a temporary category and selecting a seed document as a first 
document of the temporary ca^gory; 

collecting documents4rom the preprocessed collection of documents that 
are similar to the seed document into the temporary category; 

testing to determine if there are enough documents in the temporary 
category to merit con;6truction of a new category; 

constructing/the new category and generating a heading for the new 
category if there ^re enough documents in the temporary category to merit 
construction; 

assigofhg the seed document to a category reserved for documents not 
belonging tp any specific category if there are not enough documents in the 
temporary category; and 

larking the documents assigned to any category in the preprocessed 
collodion of documents as processed. 
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3. The method of claim 2 wherein the predefined characters include 
punctuation marks, and the changing step removes the punctuation marks/from 
the string of characters. / 

4. The method of claim 2 wherein the predefined chara^^iers include 
upper-case characters, and the changing step replaces upper-c^e characters 
with lower-case characters. / 

5. The method of claim 2 wherein the predefifled characters include 
non-root words, and the changing step replaces the rwi-root words with root 
words. / 

6. The method of claim 2 wherein the predefined characters include 
abbreviations, and the changing step replacefs the abbreviations with original 
words. / 

7. The method of claim 2 Wnerein the predefined characters include 
articles, and the changing step rennpves the articles from the string of characters. 

8. The method of clsnm 2 wherein the collecting step further includes 
the step of loading a character string from the seed document into a memory 
location to initialize the va\\/es of a number of category properties for the 
temporary category. / 

9. The rnathod of claim 8 and further comprising the steps of: 
determining'^ if there are documents in the preprocessed collection of 

documents that/nave not been processed with respect to the temporary category; 

if there/are documents in the preprocessed collection of documents that 
have not been processed with respect to the temporary category, selecting a 
next docurnent from the preprocessed collection of documents and measuring a 
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similarity with a similarity test between the selected document and a number off 
current category properties; / 

including the selected document in the temporary category if the^^lected 
document passes the similarity test; / 

updating the values of the number of category properties ofihe temporary 
category when the selected document is included; and / 

rejecting the selected document if the selected docutHent fails the 
similarity test. / 

1 0. The method of claim 9 and further con^prising the step of repeating 
the steps of claim 9 for all documents in preprocessed collection of documents. 

1 1 . The method of claim 2 whereir/the collecting step further includes 
the step of collecting more similar documefnts from a number of existing 
categories. / 

12. The method of claim/1 1 and further comprising the steps of: 
determining if there are moxe documents in a number of existing 

categories that have not beer/processed with respect to the temporary category; 

if there are documeriis in the number of existing categories that have not 
been processed with respect to the temporary category, selecting a next 
document from the nupnber of existing categories as a selected document and 
measuring a similarity with a similarity test between the selected document and a 
number of current/category properties; 

includingahe selected document in the temporary category if the selected 
document pa^es the similarity test; and 

rejeje^ing the selected document if the selected document fails the 
similaritwrest. 
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1 3. The method of claim 1 2 and further comprising the step or 
repeating the steps of claim 12 for all documents in the number of existing 
categories. / 

14. The method of claim 8 wherein the category jaffoperties includes a 
string of characters selected from the group consisting or a longest common sub- 
string in the title, a longest common substring in the body; and a document type 
index measured as list of fractional numbers for each document type. 

15. The method of claim 14 whereipr a document type includes types 
selected from the group consisting of news^rticle, technical documents, and 
poems. / 

1 6. The method of claim 2^and further comprising the steps of: 
making sub-categories if tb<ere are too many documents in a given 

category; and / 

post-processing the ni/mber of categorized lists of documents. 

17. The methoja of claim 16 wherein the categorized list of documents 
is post-preprocessed by the following steps: 

merging two categories that each have a heading where there is too much 
overlap in the heamngs~of the two "cat^^ ~ " - 

promoting/sub-categories to an upper level in a hierarchy when there are 
not enough categories in the upper level. 

18. / The method of claim 2 wherein the seed document is a first 
document/in the preprocessed collection of documents. 

/q. The method of claim 2 wherein the seed document is a document 
with a/highest rank value among the documents not marked as processed in the 
prep/ocessed collection of documents. 
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20. The method of claim 2 wherein the temporary category is testec/to 
determine if there are enough documents in the temporary category to mem 
construction of a new category by accumulating the weight of each docurment 
when each document can contribute uniform weight or different weioHt based on 
the rank value of each document with higher ranked document given more 
weight. / 

21 . The method of claim 2 wherein the heading is a longest common 
substring in a title. / 

22. The method of claim 21 wherein theyheading includes a number of 
longest common substrings. / 

23. The method of claim 1 and fufther comprising the steps of: 
determining if an anchor-text chaiwjter string is available for the 

documents in the initial collection of documents; and 

attaching an anchor-text chai?acter string to the string of characters that 
represents the documents in the irrftial collection of documents when the anchor- 
text character string is available/ 

24. The method of claim 23 wherein the anchor-text character string is 
a text used most frequeri^ by hypertext documents. 

25. The meffhod of claim 23 wherein the anchor-text character string is 
a text with a highest partial extrinsic rank value. 

26. A method of categorizing an initial collection of documents, each 
document i)eing represented by a string of characters, the method comprising 
the steps! of: 
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constructing a number of categories from the initial collection/of 
documents wherein a category is constructed by: / 

clearing a temporary category and selecting a se^ document as a 
first document of a temporary category; / 

collecting documents from the initial collection of documents to the 
temporary category that are similar to the seed documecn; 

testing to determine if there are enough /(ocuments in the 
temporary category to merit construction of a new category; 

constructing the new category and generating a heading for the 
new category if there are enough documents in/the temporary category to merit 
construction; / 

assigning the seed documentyio a category reserved for documents 
not belonging to any specific category if thfere are not enough documents in the 
temporary category; and / 

marking the documents ^assigned to any category in the initial 
collection of documents as processe^d; and 

assigning each document ipi the initial collection of documents to a 
category to form a hierarchy of categories of documents. 

27. The method otxlaim 26 wherein the collecting step further includes 
the step of loading a charalcter string from the seed document into a memory 
locatiorT to initialize values of a number of category properties for the temporary 
category. / 

28. The method of claim 27 and further comprising the steps of: 
determining if there are documents in the initial collection of documents 

that have not b^en marked as processed; 

if there/are documents in the initial collection of documents that have not 
been marked as processed, selecting a next document from the initial collection 
of documerits and measuring a similarity with a similarity test between the 
selectedydocument and a number of current category properties; 
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including the selected docunnent in the temporary category if the seleofed 
document passes the similarity test; and / 

rejecting the selected document if the selected document fails ihd 
similarity test. X 

29. The method of claim 28 and further comprising tWe step of 
repeating the steps of claim 28 for all documents in initial og^ection of 
documents. / 

30. The method of claim 26 wherein the/x)llecting step further includes 
the step of collecting more similar documents frpm a number of existing 
categories. / 

31 . The method of claim 30 ajM further comprising the steps of: 
determining if there are more documents in the number of existing 

categories that have not been pro^ssed with respect to the temporary category; 

if there are documents iiythe number of existing categories that have not 
been processed with respectfo the temporary category, selecting a next 
document from the numbej?4)f existing categories and measuring a similarity with 
a similarity test betweeryuie selected document and a number of current 
category properties; / 

including the/selected "document in the temporary category if the selected 
document passesahe similarity test; and 

rejectino/the selected document if the selected document fails the 
similarity test/ 

3^ The method of claim 31 and further comprising the step of 
repeating the steps of claim 31 for all documents in number of existing 
categories. 
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33. The method of claim 1 wherein each document in the preproce;B?sed 
collection of documents is assigned to one or more categories to form a / 
hierarchy of categories. / 

34. The method of claim 26 wherein each document in the initial 
collection of documents is assigned to one or more categories m form a 
hierarchy of categories. / 

35. The method of claim 2 and further comprL^ng the step of repeating 
the steps of claim 2 until all documents in the preprojressed collection of 
documents are marked as assigned to a category/ 

36. The method of claim 35 whereii/the documents in the 
preprocessed collection of documents are initialized as unmarked before 
selecting a first seed document. / 

37. The method of claim 26 and further comprising the step of 
repeating the constructing steps of claim 26 until all documents in the initial 
collection of documents are mamed as assigned to a category. 

38. The method Q(T claim 37 wherein the documents in the 
preprocessed collection /f documents are initialized as unmarked before 
selecting a first seed document. 

39. An apparatus that categorizes a collection of documents, each 
document beino^ represented by a string of characters, the apparatus comprising: 

means^or identifying predefined characters in the string of characters from 
each docuDnent to form identified characters; 

nieans for changing the identified characters in each document to form a 
preprocessed collection of documents; 
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means for constructing a numb^p^f categories from the preprocessed 
collection of documents; and 

means for assignipg^ch document in the preprocessed collection of 
documents to a caJ«i^ory to form a number of categorized lists of documents. 
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